Quantium Data Analytics Virtual Internship

Assignment 1 : Data preparation and Customer Analytics.

The objective of this task is to help the Category Manager for Chips in a grocery store, better understand the types of customers who purchase Chips and their purchasing behaviour within the region.

To accomplish this task, we are provided with two datastes, 1 Year of potato chip transactions and Customer details.

We are going to prepare these datasets for analysis by cleaning and merging them together. We then proceed to creating extra features that we might deem necessary for the analysis.

Once we have the final dataset ready, we can proceed to analysing the data with the goal of understanding which segments are the biggest drivers of sales and what are their purchasing preferences to create recomendations for the category manager.

Data preparation

Load Libraries and Data

Transactions

We have 264,836 transactions from July 2018 to July 2019.

We need to change columns number 1 to 5 as categorical.

Let’s check the products for consistency i.e. we don’t have any other products in there.

Looks like Salsa repeats a lot, might be salsa falvoured chips or actual salsa. Lets confirm.

After some googling looks like these products are not chips and indeed salsa, lets drop all rows with these products.

Looks like we have solved all data quality issues for the transacitons data set.
Now on to the customers data set.

Customers

The customer dataset looks good, now we can procced to merging the two datasets.

Feature Generation

The goal in this section is to generate new columns (features) from the existing ones to enrich our dataset.

Price of each item

Brand of each product

Looks like the brand names are not standardized. We have errors like : Doritos and Dorito. We can manualy map the names as they are not many.

Cool, now we have added a brand column to our dataset by extracting it from the product names.

Size of each packet

We have succesfully extracted and converted the weight of each packet to a usable format.

Packet size segmentation

We can use the packet size column to generate size segments, i will divide the sizes in to small, medium, big and verry big.

The box plot shows us that most of the products weigh between 150g and 175g, we will use that as our middle point to create our size category. Lets get to coding.

As the box plot showed we can see that we have 205,567 units sold that weigh between 110 and 175g, we can see already there is a preference for that size range. We will confirm that later on in the analysis.

Day of Week and Month

Exploratory Data Anlysis

Now that we have cleaned our data we are ready to start analysing.

We know that the client is intrested in understanding her customers purchasing patterns and finding out which customer groups to target to increase her sales. To get a better understanding of the object of analysis which in this case is the customers lets rank the segments on the following metrics:

  1. Number of customers in each segment.
  2. Percent of total revenue.
  3. Revenue per customer.
  4. Packets Bought per Customer.
  5. Average Chip Price by Segment.

After getting a clear picture from these metrics we can dig deeper. Lets get on with it.

Biggest Segments:

  1. Young Singles/Couples (Mainstrean) = 11.1%
  2. Retirees (Mainstream) = 8.92%
  3. Older Singles/Couples (All) = 20%

Highest Revenue:

  1. Old Families (Budget) = 8.68%
  2. Young Singles/Couples (Mainstream) = 8.19%
  3. Retirees (Mainstream) = 8.06%

Revenuee per Customer:

  1. Older Families = 33.88
  2. Young Families = 32.38
  3. Older Singles/Couples = 25.98

Average Number of Packets bought:

  1. Older Families = 9.2 ~ 9
  2. Young Families = 8.7 ~ 9
  3. Older Singles/Couples = 6.73 ~ 7

Average Price Per Chip:

  1. Young Singles/Couples (Mainstream) = 4.06
  2. Midage Singles/Couples (Mainstream) = 3.99
  3. New families (Mainstream & Budget) + Retirees (Budget & Premium) = 3.90

Our biggest segment is Young Singles/Couples (Mainstream), we should expect that since they are the majority they should bring in the most revenue but interestingly enough the highest proportion of revenue (8.68%) comes from Older Families (budget) which make up 6.47% of customers.
Young Singles/Couples (Mainstream) that make up 11.1% of all customers were responsible for 8.19% of total revenue the second best. This can be explained by the fact that Older and Younger families buy more packets, on average about 9 compared to about 5 for Young Singles/Couples (Mainstream).
On the other hand, Young Singles/Couples (Mainstream) pay more per chip, on average 4.06 while Older Families (budget) spend on average 3.74.
We can conclude that Young Singles/Couples (Mainstream) are more prone to impulse purchases and since they have less people in the household, they purchase the more expensive brands at less quantity.
On the contrary Older Families (budget) try to buy cheaper brands but more quantity.

We can visualise that relationship by plotting average price vs average quantity.
But before we have to test if the Average price difference is statistically significant before we start assuming it in our analysis.

We have selected Midage and Young singles/couples both mainstream as our chip fans. Lets test these sample to see if the differance is significant if yes we will dig a bit deeper to see what might cause the above average spend. Lets test Midage singles/couples & Young singles/couples Mainstream

Lets conduct a 2 sample t-test on the control population (all segments except our chip fans) and our chip fans.

2 sample t-test assums that both data sets must have almost the same mean and they are identically and idependentaly distributed. We can check for the mean and skew of the datasets to see if our datasets pass those assumptions

Looks like we have the same distributions for both data sets. Lets proced with the test.\ Our null hypothesis : Both groups have no statisticaly signifacant differance in average ammount of sales.

We have a pvalue much lower than our alpha of 0.05 so we can conclude that the diferance of average price is not due to chance. Now we can assume that these segments are more likly to buy more expensive brands.
Lets see the relationship between quantity sold and average price.

The above scatter plot shows some interesting insights, as we thought we see families tend to purchase more quantity but tend to buy less expensive chips but this is noted only for the Older and Young families, New families do not follow this trend, they show preference for quality over quantity indicated by lower quantity and higher average price.

Also, interestingly we see that Young Singles/Couples and Midge Singles/Couples are willing to pay more per chip when compared with their premium and budget counterparts. We can note that Sinlges/Couples buy more as they get older, note that Young S/C buy on average less than Midage S/C and so on, maybe due to increased health awareness from younger demographics.

Let’s get to know better the segments that buy more than the average and pay more than average with the intent of finding ways to keep these segments buying and retaining their business.
Quantity:

Price:

Its in the clients intrest to retain and serve well these segments, for this the store should keep stocked the brands and sizes these segments tend to buy the most. Lets figure out which those are.

Older Families & Young Families

We are going to create a affinity table for brand and packet size for each of the 4 segments. How are we going to make this affinity table? Well if you think about it we can only get an understanding about prefrences towards a specific brand or size by comparing between what proportion a brand ocupies in the overall total quantity.

Example : Lets take in to consideration our Older families segment, We will compare how much the brand 'Kettle' contributes to total quantity sold between our target segment and the population (Everyone - our target). In this example we can imagine something like this:
Brand : Kettle
Target %: 0.2
Population %: 0.16
Affinity = 0.2/0.16 = 1.25. We can say that our target segment is 25% more likley to purchase kettle over the population.
Lets code the following steps in a function so we can apply them to all the 4 segments with out the need to code it each time.

Older Families Brand and Size Prefrences

Older families are about 30% more likley to buy NCC, Burger Rings, WW & RRD.
This segment is 30% more likley to buy packets sized between 180 - 220grams

Young Families Brand and Size Preferences

While Young families are amlost 30% more likly to buy Sunbites, French Fries and CCs. What about size ?

Young families are 30% more likely to buy 90g packets. Smaller packets might be due to the children in the family bieng smaller.

What brands sell 90g packets?

We can conclude that young families are more likly to buy brands that offer smaller packets probably because they buy chips for children as a snack.

Midage Singles/Couples & Young Singles/Couples (Mainstream)

Lets do the same for these segments.

Midage S/C Brand and Size Preference

Young S/C Brand and Size preference

It seems like we have uncovered prefrences in all the segments, lets list the top prefrences for brand and size:

Older Families:
Brand: 30% more likley to buy NCC & Burger Rings.
Size: 30% more likley to buy packets sized 180g, 200g & 220g.

Young Families:
Brand: 30% more likley to buy Sunbites & French fries.
Size: 30% more likley to buy smallest packs 90g to be exact.

Midage S/C:
Brand: 16% more likley to buy Twisties and 14% to buy Kettle.
Size: 22% more likely to buy 270g packs and 19% 330g.

Young S/C Brand: 23% more likley to buy Tyrrells and 22% Twisties.
Size: 27% more likley to buy 270g packs and 26% 380g

We now have the stats for our selected segments, What do these brands have in commun ? Can we see why these segments have these prefrences ? Lets see what are the commun properties between these brands.

Let’s recap what we’ve found! Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers.

Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behaviour.

We’ve also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category’s performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and couples frequent more often to increase visibilty and impulse behaviour. Quantium can help the Category Manager with recommendations of where these segments are and further help them with measuring the impact of the changed placement. We’ll work on measuring the impact of trials in the next project.

Conclusion

What have we found out?

Most sales come from Old Families (Budget) Young Singles/Couples (Mainstream) & Retirees (Mainstream). Older Families, Young Families & Older Singles/Couples bring in most revenue per customer. Young Singles/Couples (Mainstream) & Midage Singles/Couples (Mainstream) pay more per chip packet. Older families and Young families buy more packets of chips on average. Older and Young families are about 30% more likley to buy brands that cost under 3.00, young families are more likley to buy small packets. Midage and Young Singles/Couples are about 20% more likley to buy more expensive brands, above 4.00.

Our recommendations:

  1. Increase small sized pack offerings to increase the number of families buying chips and to keep them returning.
  2. Positions the smaller and cheaper packs closer to other items families frequntly buy to increase purchases.
  3. Create stands that only house chips bellow 3.00 and other with chips above 4.00 at strategic points in the store as to make the customers go buy other items and probably add them to their shopping basket.